Importing Libraries

Importing the Dataset

Checking for datatypes and missing values
Replacing missing values NaN with median of the respective columns

Data Analysis and Visualization

Scaling data for easier boxplot analysis and to be used later for PCA as well
Studying distributions of each of the variables
Studying distributions of the different variables by categories of vehicles
We see that there is significant overlap between the different categories of vehicles on most variables therefore with just individual variables it may be difficult to identify the category of the vehicle
Creating the correlation matrix of the independent variables
We see that there are quite a few independent variables which are strongly correlated. This will lead to multicollinearity
Generating a pairplot to check all variables at one go

Building a classifier

Split Data into X variables and Target variables
Splitting into train and test
Training a SVC classifier
Model Score on Train set
Model score on Test set
We see a huge drop in accuracy of the model. Indicating that the model was overfit.
Checking other metrics for accuracy of predictions on the Test set
We see that almost everything was predicted as a car

Dimensionality Reduction

Creating and printing the covariance matrix
Seven Dimensions seem reasonable. They explain approximately 98% of the variation
Displaying the transformed data with seven variables
Checking if multicollinearity has been removed.
We see almost no correlations between with new variables

Fitting SVC on dimensionally reduced variables

We see a significant improvement in model performance
Overall we see a massive improvement in model performance due to PCA

Conclusion

  1. We saw that with 18 independent dimensions the model that was created was really overfit.
  2. We also saw that there was high correlation between the independent variables that lead to multicollinearity

Both these aspects are taken care of when we reduce the dimensionality using Principal Component Analysis to seven dimensions.

  1. Although there is some information loss, we are able to still get a good fit for the SVC model
  2. We see that the model performance on test set is almost as good as the training set.

These were the two key points how dimensionality reduction helped in this case.